1 Librerías

2 El problema general

  • Desde el auge tecnológico de secuenciación del genoma humano…

  • Clasificar la patogenicidad de las mutaciones

  • Causalidad en enfermedades

2.1 Patogenicidad

24 APRIL 2014

2.2 Algoritmos In-silico

3 El problema específico

  • Patogenicidad en TP53
  • Adaptación de las guías de ACMG
  • In-silico tools

3.1 A-GVGD + REVEL + BayesDel

3.1.1 Patogenicidad

3.2 Quantitative model

3.3 ERIC

3.4 Seshat

4 Los datos

4.1 UMD TP53

Se utilizan las mutaciones missense

4.2 dbNSFP4.0b1a

Se utilizan 25 predictores:

4.3 Covariables

##  [1] "SIFT_score"                 "FATHMM_score"              
##  [3] "PROVEAN_score"              "VEST4_score"               
##  [5] "Polyphen2_HDIV_score"       "Polyphen2_HVAR_score"      
##  [7] "MutationAssessor_score"     "LRT_score"                 
##  [9] "fathmm.MKL_coding_score"    "fathmm.XF_coding_score"    
## [11] "MutPred_score"              "PrimateAI_score"           
## [13] "integrated_fitCons_score"   "GM12878_fitCons_score"     
## [15] "H1.hESC_fitCons_score"      "HUVEC_fitCons_score"       
## [17] "GERP.._RS"                  "phyloP100way_vertebrate"   
## [19] "phyloP30way_mammalian"      "phyloP17way_primate"       
## [21] "phastCons100way_vertebrate" "phastCons30way_mammalian"  
## [23] "phastCons17way_primate"     "X29way_logOdds"            
## [25] "bStatistic"

5 Cruce de datos

  • UMD TP53 contiene 80406 mutaciones, de las cuales 57885 son mutaciones missense.

  • Las mutaciones usan la nomenclatura NM_000546.5 (GRCH37, hg19). Utilizando Mutalyzer 2.0.29, se transformaron la coordenadas a NC_000017.11 (GRCH38, hg38)

  • Se tomaron las variaciones únicas y se cruzaron con la base de datos dbNSFP4.0b1a, a diciembre 8 del 2018, para conseguir los 25 predictores (26 variables con REVEL).

  • Se obtuvieron 1764 variantes únicas luego de este cruce.

  • Se perdieron 2 variaciones que no eran verdaderas mutaciones missense. La variable respuesta patogenicidad se obtiene de la base de datos UMD.

6 Análisis descriptivo

6.1 Resumen de patogenicidad

  • Mutaciones missense de UMD (57885).
Patogenicidad Frecuencia
Benign 80 (0.1%)
Possibly pathogenic 9,979 (17.2%)
Likely Pathogenic 5,871 (10.1%)
Pathogenic 38,145 (65.9%)
VUS 3,810 (6.6%)
  • Variantes únicas luego del cruce con dbnsfp (1764).
Patogenicidad Frecuencia
Benign 16 (0.9%)
Possibly pathogenic 465 (26.4%)
Likely Pathogenic 90 (5.1%)
Pathogenic 102 (5.8%)
VUS 1,091 (61.8%)

6.2 Tabla 1

Variable sccontinuas son resumidas por mediana y rango completo

Benign
No. 16
Possibly pathogenic
No. 465
Likely Pathogenic
No. 90
Pathogenic
No. 102
VUS
No. 1,091
Total
No. 1,764
SIFT
  Median (IQR) 0.26 (0.21 - 0.45) 0.00 (0.00 - 0.00) 0.00 (0.00 - 0.00) 0.00 (0.00 - 0.00) 0.06 (0.00 - 0.24) 0.01 (0.00 - 0.12)
  Missing 13 (81.25%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 13 (0.74%)
FATHMM
  Median (IQR) -5.45 (-5.48 - -5.44) -6.97 (-7.27 - -6.72) -7.29 (-7.41 - -6.99) -7.30 (-7.50 - -7.04) -6.31 (-6.73 - -5.49) -6.69 (-7.05 - -5.56)
  Missing 13 (81.25%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 13 (0.74%)
PROVEAN
  Median (IQR) -0.14 (-0.18 - 0.15) -4.88 (-6.55 - -2.89) -6.21 (-7.75 - -3.89) -6.17 (-8.43 - -3.94) -1.17 (-3.03 - -0.19) -2.62 (-5.11 - -0.66)
  Missing 13 (81.25%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 13 (0.74%)
VEST4
  Median (IQR) 0.08 (0.08 - 0.12) 0.82 (0.65 - 0.90) 0.92 (0.81 - 0.96) 0.93 (0.87 - 0.96) 0.36 (0.22 - 0.57) 0.56 (0.30 - 0.82)
  Missing 13 (81.25%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 13 (0.74%)
Polyphen2 HDIV
  Median (IQR) 0.01 (0.00 - 0.05) 1.00 (0.95 - 1.00) 1.00 (0.99 - 1.00) 1.00 (0.99 - 1.00) 0.16 (0.00 - 0.96) 0.87 (0.02 - 1.00)
  Missing 13 (81.25%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 13 (0.74%)
Polyphen2 HVAR
  Median (IQR) 0.00 (0.00 - 0.02) 0.99 (0.84 - 1.00) 1.00 (0.96 - 1.00) 0.99 (0.98 - 1.00) 0.19 (0.01 - 0.88) 0.77 (0.06 - 0.99)
  Missing 13 (81.25%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 13 (0.74%)
MutationAssessor
  Median (IQR) 0.97 (0.96 - 1.16) 2.88 (2.35 - 3.18) 3.16 (2.90 - 3.27) 3.20 (2.93 - 3.29) 1.93 (1.32 - 2.51) 2.36 (1.67 - 2.98)
  Missing 13 (81.25%) 0 (0%) 1 (1.11%) 0 (0%) 3 (0.27%) 17 (0.96%)
LRT
  Median (IQR) 0.36 (0.18 - 0.37) 0.00 (0.00 - 0.00) 0.00 (0.00 - 0.00) 0.00 (0.00 - 0.00) 0.00 (0.00 - 0.13) 0.00 (0.00 - 0.03)
  Missing 13 (81.25%) 0 (0%) 0 (0%) 0 (0%) 10 (0.92%) 23 (1.30%)
fathmm MKL coding
  Median (IQR) 0.05 (0.03 - 0.43) 0.99 (0.96 - 0.99) 0.99 (0.98 - 1.00) 0.99 (0.98 - 0.99) 0.79 (0.18 - 0.97) 0.95 (0.40 - 0.99)
  Missing 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%)
fathmm XF coding
  Median (IQR) 0.06 (0.05 - 0.09) 0.90 (0.64 - 0.95) 0.93 (0.81 - 0.96) 0.94 (0.85 - 0.96) 0.31 (0.15 - 0.68) 0.57 (0.21 - 0.91)
  Missing 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%)
MutPred
  Median (IQR) 0.28 (0.20 - 0.55) 0.85 (0.73 - 0.92) 0.91 (0.86 - 0.96) 0.94 (0.89 - 0.97) 0.47 (0.26 - 0.70) 0.68 (0.38 - 0.86)
  Missing 4 (25.00%) 16 (3.44%) 1 (1.11%) 5 (4.90%) 24 (2.20%) 50 (2.83%)
PrimateAI
  Median (IQR) 0.34 (0.30 - 0.37) 0.60 (0.49 - 0.69) 0.66 (0.60 - 0.72) 0.69 (0.59 - 0.74) 0.44 (0.35 - 0.54) 0.50 (0.39 - 0.63)
  Missing 14 (87.50%) 4 (0.86%) 1 (1.11%) 0 (0%) 58 (5.32%) 77 (4.37%)
integrated fitCons
  Median (IQR) 0.67 (0.64 - 0.67) 0.72 (0.72 - 0.72) 0.72 (0.72 - 0.72) 0.72 (0.72 - 0.72) 0.72 (0.72 - 0.72) 0.72 (0.72 - 0.72)
  Missing 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%)
GM12878 fitCons
  Median (IQR) 0.70 (0.67 - 0.70) 0.70 (0.70 - 0.70) 0.70 (0.70 - 0.70) 0.70 (0.70 - 0.70) 0.70 (0.70 - 0.70) 0.70 (0.70 - 0.70)
  Missing 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%)
H1.hESC fitCons
  Median (IQR) 0.61 (0.61 - 0.61) 0.70 (0.70 - 0.72) 0.70 (0.70 - 0.72) 0.72 (0.70 - 0.72) 0.70 (0.70 - 0.72) 0.70 (0.70 - 0.72)
  Missing 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%)
HUVEC fitCons
  Median (IQR) 0.66 (0.32 - 0.69) 0.74 (0.74 - 0.74) 0.74 (0.74 - 0.74) 0.74 (0.74 - 0.74) 0.74 (0.74 - 0.74) 0.74 (0.74 - 0.74)
  Missing 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%)
GERP++ RS
  Median (IQR) 0.77 (-0.64 - 1.21) 4.62 (3.53 - 5.13) 4.62 (3.64 - 5.28) 4.62 (3.64 - 5.13) 2.92 (0.29 - 4.45) 3.65 (1.40 - 4.71)
  Missing 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%)
phyloP100way vertebrate
  Median (IQR) -0.04 (-0.33 - 0.46) 6.94 (2.62 - 8.18) 7.91 (5.31 - 9.30) 7.91 (5.30 - 7.91) 1.14 (0.13 - 3.79) 2.51 (0.44 - 7.47)
  Missing 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%)
phyloP30way mammalian
  Median (IQR) 0.13 (-0.41 - 0.73) 1.14 (1.02 - 1.30) 1.14 (1.02 - 1.18) 1.03 (1.02 - 1.14) 1.00 (0.11 - 1.14) 1.03 (0.18 - 1.17)
  Missing 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%)
phyloP17way primate
  Median (IQR) 0.47 (-0.20 - 0.67) 0.67 (0.60 - 0.75) 0.67 (0.60 - 0.68) 0.60 (0.60 - 0.66) 0.66 (0.35 - 0.67) 0.66 (0.60 - 0.68)
  Missing 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%)
phastCons100way vertebrate
  Median (IQR) 0.00 (0.00 - 0.02) 1.00 (1.00 - 1.00) 1.00 (1.00 - 1.00) 1.00 (1.00 - 1.00) 0.68 (0.00 - 1.00) 1.00 (0.02 - 1.00)
  Missing 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%)
phastCons30way mammalian
  Median (IQR) 0.22 (0.06 - 0.66) 0.97 (0.83 - 1.00) 0.98 (0.84 - 1.00) 0.98 (0.87 - 1.00) 0.72 (0.03 - 0.98) 0.90 (0.10 - 0.99)
  Missing 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%)
phastCons17way primate
  Median (IQR) 0.97 (0.86 - 0.99) 0.97 (0.74 - 1.00) 0.98 (0.75 - 1.00) 0.98 (0.84 - 1.00) 0.77 (0.21 - 0.99) 0.91 (0.55 - 0.99)
  Missing 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%)
X29way logOdds
  Median (IQR) 4.24 (3.38 - 4.87) 12.31 (9.50 - 13.83) 12.94 (10.68 - 15.36) 13.59 (10.88 - 15.66) 8.18 (5.20 - 11.98) 9.98 (6.46 - 13.06)
  Missing 0 (0%) 0 (0%) 0 (0%) 0 (0%) 7 (0.64%) 7 (0.40%)
bStatistic
  Median (IQR) 439.00 (436.00 - 439.00) 433.00 (433.00 - 434.00) 433.00 (433.00 - 434.00) 433.00 (432.00 - 434.00) 434.00 (433.00 - 434.00) 434.00 (433.00 - 434.00)
  Missing 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%)
REVEL
  Median (IQR) 0.28 (0.26 - 0.35) 0.87 (0.74 - 0.93) 0.91 (0.87 - 0.94) 0.92 (0.88 - 0.95) 0.52 (0.38 - 0.72) 0.69 (0.46 - 0.88)
  Missing 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%)

6.3 Casos completos

  • Las variantes VUS no se utilizarán para modelar
  • De las 673 variantes restantes cuantas son casos completos (tienen todos los predictores):
Total
No. 673
Complete
No. 632
Incomplete
No. 41
Pathogenicity
  Benign 16 (2.4%) 0 (0.0%) 16 (100.0%)
  Possibly pathogenic 465 (69.1%) 447 (96.1%) 18 (3.9%)
  Likely Pathogenic 90 (13.4%) 88 (97.8%) 2 (2.2%)
  Pathogenic 102 (15.2%) 97 (95.1%) 5 (4.9%)

6.4 Datos faltantes por columna

Variable Faltantes %
MutPred_score 26 3.86
PrimateAI_score 19 2.82
MutationAssessor_score 14 2.08
SIFT_score 13 1.93
FATHMM_score 13 1.93
PROVEAN_score 13 1.93
VEST4_score 13 1.93
Polyphen2_HDIV_score 13 1.93
Polyphen2_HVAR_score 13 1.93
LRT_score 13 1.93

6.5 Bloque de datos faltantes

cDNA_variant HG19_variant HG38_variant position ref alt aaref aaalt Ensembl_transcriptid Ensembl_proteinid Uniprot_acc Uniprot_entry Codon WT_AA_1 WT_AA_3 Mutant_AA_1 Mutant_AA_3 SIFT_score FATHMM_score PROVEAN_score VEST4_score Polyphen2_HDIV_score Polyphen2_HVAR_score MutationAssessor_score LRT_score fathmm.MKL_coding_score fathmm.XF_coding_score MutPred_score PrimateAI_score integrated_fitCons_score GM12878_fitCons_score H1.hESC_fitCons_score HUVEC_fitCons_score GERP.._RS phyloP100way_vertebrate phyloP30way_mammalian phyloP17way_primate phastCons100way_vertebrate phastCons30way_mammalian phastCons17way_primate X29way_logOdds bStatistic REVEL_score Pathogenicity
NM_000546.5:c.139C>T NC_000017.10:g.7579548G>A NC_000017.11:g.7676230G>A 7676230 G A P S ENST00000269305 ENSP00000269305 P04637 P53_HUMAN 47 P Pro S Ser 0.157 -5.51 0.44 0.156 0.009 0.004 0.975 0.000006 0.01973 0.079916 NA 0.2708584 0.722319 0.702456 0.698795 0.735409 -7.900 -3.332 -0.212 -1.044 0.000 0.000 0.001 1.8607 434 0.320 Benign
NM_000546.5:c.215C>G NC_000017.10:g.7579472G>C NC_000017.11:g.7676154G>C 7676154 G C P R ENST00000269305 ENSP00000269305 P04637 P53_HUMAN 72 P Pro R Arg 0.258 -5.45 -0.23 0.083 0.083 0.045 1.355 0.370853 0.36143 0.186080 NA 0.4004812 0.722319 0.702456 0.698795 0.735409 1.870 1.438 1.141 0.672 0.003 0.001 0.001 9.7733 434 0.368 Benign
NM_000546.5:c.91G>A NC_000017.10:g.7579705C>T NC_000017.11:g.7676387C>T 7676387 C T V I ENST00000269305 ENSP00000269305 P04637 P53_HUMAN 31 V Val I Ile 0.635 -5.43 -0.14 0.082 0.001 0.002 0.935 0.362815 0.01403 0.066351 NA NA 0.722319 0.702456 0.698795 0.735409 -4.850 -0.909 -0.690 -0.197 0.000 0.319 0.969 7.7560 434 0.569 Benign
NM_000546.5:c.993+217G>C NC_000017.10:g.7576636C>G NC_000017.11:g.7673318C>G NA NA NA E Q NA NA NA NA 339-beta E Glu Q Gln NA NA NA NA NA NA NA NA 0.69465 0.052691 0.071 NA 0.671770 0.702456 0.596874 0.692231 1.600 0.431 -0.524 -0.229 0.074 0.132 0.969 6.6486 436 0.325 Benign
NM_000546.5:c.993+218A>G NC_000017.10:g.7576635T>C NC_000017.11:g.7673317T>C NA NA NA E G NA NA NA NA 339-beta E Glu G Gly NA NA NA NA NA NA NA NA 0.61753 0.054952 0.078 NA 0.671770 0.702456 0.596874 0.692231 1.510 0.093 -0.368 0.661 0.071 0.205 0.978 4.3668 436 0.254 Benign
NM_000546.5:c.993+220A>G NC_000017.10:g.7576633T>C NC_000017.11:g.7673315T>C NA NA NA N D NA NA NA NA 340-beta N Asn D Asp NA NA NA NA NA NA NA NA 0.67094 0.074269 0.075 NA 0.671770 0.702456 0.596874 0.692231 1.200 0.561 0.879 0.661 0.063 0.717 0.984 4.4007 436 0.269 Benign
NM_000546.5:c.993+223T>G NC_000017.10:g.7576630A>C NC_000017.11:g.7673312A>C NA NA NA C G NA NA NA NA 341-beta C Cys G Gly NA NA NA NA NA NA NA NA 0.85353 0.135663 0.279 NA 0.671770 0.702456 0.596874 0.692231 1.260 0.611 1.011 0.751 0.036 0.928 0.991 4.1073 436 0.228 Benign
NM_000546.5:c.993+281T>G NC_000017.10:g.7576572A>C NC_000017.11:g.7673254A>C NA NA NA L V NA NA NA NA 336-gamma L Leu V Val NA NA NA NA NA NA NA NA 0.11713 0.059946 0.581 NA 0.562547 0.577304 0.608884 0.322989 0.893 0.337 0.106 -0.104 0.002 0.693 0.989 4.0172 439 0.248 Benign
NM_000546.5:c.993+283A>C NC_000017.10:g.7576570T>G NC_000017.11:g.7673252T>G NA NA NA L F NA NA NA NA 336-gamma L Leu F Phe NA NA NA NA NA NA NA NA 0.04800 0.058936 0.544 NA 0.562547 0.577304 0.608884 0.322989 -0.991 0.059 -0.054 0.497 0.001 0.657 0.994 3.1993 439 0.286 Benign
NM_000546.5:c.993+285G>A NC_000017.10:g.7576568C>T NC_000017.11:g.7673250C>T NA NA NA R Q NA NA NA NA 337-gamma R Arg Q Gln NA NA NA NA NA NA NA NA 0.06029 0.057955 0.602 NA 0.562547 0.577304 0.608884 0.322989 0.766 -0.135 0.686 0.450 0.001 0.660 0.996 4.7998 439 0.349 Benign
NM_000546.5:c.993+285G>C NC_000017.10:g.7576568C>G NC_000017.11:g.7673250C>G NA NA NA R P NA NA NA NA 337-gamma R Arg P Pro NA NA NA NA NA NA NA NA 0.08690 0.131111 0.656 NA 0.562547 0.577304 0.608884 0.322989 0.766 -0.135 0.686 0.450 0.001 0.660 0.996 4.7998 439 0.361 Benign
NM_000546.5:c.993+309C>T NC_000017.10:g.7576544G>A NC_000017.11:g.7673226G>A NA NA NA S L NA NA NA NA 345-gamma S Ser L Leu NA NA NA NA NA NA NA NA 0.03685 0.041398 0.291 NA 0.671770 0.702456 0.608884 0.322989 0.893 0.882 0.953 0.672 0.009 0.242 0.900 5.0959 439 0.261 Benign
NM_000546.5:c.993+311T>C NC_000017.10:g.7576542A>G NC_000017.11:g.7673224A>G NA NA NA S P NA NA NA NA 346-gamma S Ser P Pro NA NA NA NA NA NA NA NA 0.02266 0.092732 0.257 NA 0.671770 0.702456 0.608884 0.322989 -0.636 -0.237 0.149 0.751 0.001 0.058 0.858 3.2310 439 0.255 Benign
NM_000546.5:c.993+311T>G NC_000017.10:g.7576542A>C NC_000017.11:g.7673224A>C NA NA NA S A NA NA NA NA 346-gamma S Ser A Ala NA NA NA NA NA NA NA NA 0.03689 0.050213 0.234 NA 0.671770 0.702456 0.608884 0.322989 -0.636 -0.237 0.149 0.751 0.001 0.058 0.858 3.2310 439 0.265 Benign
NM_000546.5:c.993+312C>G NC_000017.10:g.7576541G>C NC_000017.11:g.7673223G>C NA NA NA S W NA NA NA NA 346-gamma S Ser W Trp NA NA NA NA NA NA NA NA 0.04284 0.094164 0.438 NA 0.671770 0.702456 0.608884 0.662433 -0.166 -0.597 -0.919 -0.727 0.001 0.059 0.851 3.4316 439 0.387 Benign
NM_000546.5:c.993+312C>T NC_000017.10:g.7576541G>A NC_000017.11:g.7673223G>A NA NA NA S L NA NA NA NA 346-gamma S Ser L Leu NA NA NA NA NA NA NA NA 0.02373 0.043392 NA NA 0.671770 0.702456 0.608884 0.662433 -0.166 -0.597 -0.919 -0.727 0.001 0.059 0.851 3.4316 439 0.268 Benign

6.6 Predictores

Clases inseparables

6.7 REVEL

Total
No. 1,764
(0,0.5]
No. 542
(0.5,1]
No. 1,222
Pathogenicity
  Benign 16 (0.9%) 15 (93.8%) 1 (6.2%)
  Possibly pathogenic 465 (26.4%) 18 (3.9%) 447 (96.1%)
  Likely Pathogenic 90 (5.1%) 0 (0.0%) 90 (100.0%)
  Pathogenic 102 (5.8%) 0 (0.0%) 102 (100.0%)
  VUS 1,091 (61.8%) 509 (46.7%) 582 (53.3%)

7 Nuevo modelo con los 4 niveles

7.1 Data para modelar

7.2 Imputación

set.seed(17)

data_model_imputed <- rfImpute(
  data = data_model,
  Pathogenicity ~ .,
  iter = 20,
  ntree = 10001
)
## ntree      OOB      1      2      3      4
## 10001:  28.38% 12.50%  6.67% 95.56% 70.59%
## ntree      OOB      1      2      3      4
## 10001:  28.23%  6.25%  6.67% 95.56% 70.59%
## ntree      OOB      1      2      3      4
## 10001:  28.23%  6.25%  6.67% 95.56% 70.59%
## ntree      OOB      1      2      3      4
## 10001:  28.23%  6.25%  6.45% 95.56% 71.57%
## ntree      OOB      1      2      3      4
## 10001:  28.23%  6.25%  6.45% 95.56% 71.57%
## ntree      OOB      1      2      3      4
## 10001:  28.23%  6.25%  6.67% 95.56% 70.59%
## ntree      OOB      1      2      3      4
## 10001:  28.53%  6.25%  7.10% 95.56% 70.59%
## ntree      OOB      1      2      3      4
## 10001:  27.93%  6.25%  6.24% 95.56% 70.59%
## ntree      OOB      1      2      3      4
## 10001:  28.53%  6.25%  6.88% 95.56% 71.57%
## ntree      OOB      1      2      3      4
## 10001:  28.38%  6.25%  6.67% 95.56% 71.57%
## ntree      OOB      1      2      3      4
## 10001:  28.38%  6.25%  6.67% 95.56% 71.57%
## ntree      OOB      1      2      3      4
## 10001:  28.23%  6.25%  6.67% 95.56% 70.59%
## ntree      OOB      1      2      3      4
## 10001:  28.38%  6.25%  6.88% 95.56% 70.59%
## ntree      OOB      1      2      3      4
## 10001:  28.23%  6.25%  6.45% 95.56% 71.57%
## ntree      OOB      1      2      3      4
## 10001:  28.53%  6.25%  6.88% 95.56% 71.57%
## ntree      OOB      1      2      3      4
## 10001:  28.38%  6.25%  6.67% 95.56% 71.57%
## ntree      OOB      1      2      3      4
## 10001:  28.38%  6.25%  6.88% 95.56% 70.59%
## ntree      OOB      1      2      3      4
## 10001:  28.23%  6.25%  6.67% 95.56% 70.59%
## ntree      OOB      1      2      3      4
## 10001:  28.23%  6.25%  6.45% 95.56% 71.57%
## ntree      OOB      1      2      3      4
## 10001:  28.53%  6.25%  6.88% 95.56% 71.57%

7.3 Datos imputados

7.4 Parámetros para Random Forest (RF)

7.4.1 Estimar tiempo de tuning

rf_task <- makeClassifTask(
  data = data_model_imputed, 
  target = "Pathogenicity"
  )
rf_task
## Supervised task: data_model_imputed
## Type: classif
## Target: Pathogenicity
## Observations: 673
## Features:
##    numerics     factors     ordered functionals 
##          25           0           0           0 
## Missings: FALSE
## Has weights: FALSE
## Has blocking: FALSE
## Has coordinates: FALSE
## Classes: 4
##              Benign Possibly pathogenic   Likely Pathogenic 
##                  16                 465                  90 
##          Pathogenic 
##                 102 
## Positive class: NA
estimateTimeTuneRanger(rf_task, num.trees = 10001)
## Approximated time for tuning: 16M 58S
listMeasures(rf_task)
##  [1] "featperc"         "mmce"             "lsr"             
##  [4] "bac"              "qsr"              "timeboth"        
##  [7] "multiclass.aunp"  "timetrain"        "multiclass.aunu" 
## [10] "ber"              "timepredict"      "multiclass.brier"
## [13] "ssr"              "acc"              "logloss"         
## [16] "wkappa"           "multiclass.au1p"  "multiclass.au1u" 
## [19] "kappa"
set.seed(17)
res <- tuneRanger(
  rf_task, 
  num.trees = 10001,
  iters = 70, 
  iters.warmup = 30,
  measure = list(bac),
  # tune.parameters = c(
  #   "mtry",
  #   "min.node.size",
  #   "sample.fraction"
  #   ),
  parameters = list(
    importance = "permutation", 
    splitrule = "gini",
    replace = F
    # probability = T,
    ),
  show.info = F
  )
res
## Recommended parameter settings: 
##   mtry min.node.size sample.fraction
## 1   17             6       0.4309226
## Results: 
##         bac exec.time
## 1 0.5728293  8.364444
res$model$learner.model
## Ranger result
## 
## Call:
##  ranger::ranger(formula = NULL, dependent.variable = tn, data = getTaskData(.task,      .subset), probability = (.learner$predict.type == "prob"),      case.weights = .weights, ...) 
## 
## Type:                             Probability estimation 
## Number of trees:                  10001 
## Sample size:                      673 
## Number of independent variables:  25 
## Mtry:                             17 
## Target node size:                 6 
## Variable importance mode:         permutation 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.2359279

7.5 Ajuste con Ranger (keep inbag) con los parámetros encontrados

set.seed(17)

# Con probabilidad

model <- ranger(
  data = data_model_imputed,
  Pathogenicity ~ .,
  num.trees = 10001,
  mtry = res$recommended.pars$mtry,
  min.node.size = res$recommended.pars$min.node.size,
  replace=F,
  sample.fraction = res$recommended.pars$sample.fraction,
  importance = res$model$learner.model$importance.mode,
  splitrule = res$model$learner.model$splitrule,
  keep.inbag = TRUE,
  probability = T
  )
model
## Ranger result
## 
## Call:
##  ranger(data = data_model_imputed, Pathogenicity ~ ., num.trees = 10001,      mtry = res$recommended.pars$mtry, min.node.size = res$recommended.pars$min.node.size,      replace = F, sample.fraction = res$recommended.pars$sample.fraction,      importance = res$model$learner.model$importance.mode, splitrule = res$model$learner.model$splitrule,      keep.inbag = TRUE, probability = T) 
## 
## Type:                             Probability estimation 
## Number of trees:                  10001 
## Sample size:                      673 
## Number of independent variables:  25 
## Mtry:                             17 
## Target node size:                 6 
## Variable importance mode:         permutation 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.2360206

7.6 Convergencia con ntree

set.seed(17)

# Sin probabilidad

model2 <- ranger(
  data = data_model_imputed,
  Pathogenicity ~ .,
  num.trees = 10001,
  mtry = res$recommended.pars$mtry,
  min.node.size = res$recommended.pars$min.node.size,
  # replace=F,
  sample.fraction = res$recommended.pars$sample.fraction,
  importance = res$model$learner.model$importance.mode,
  splitrule = res$model$learner.model$splitrule,
  keep.inbag = TRUE
  # probability = T
  )
OOBCurve(
  model2, 
  data= data_model_imputed, 
  task = rf_task,
  measures = list(bac)
  ) -> oobcurve
oobcurve %>%
  add_column(num.trees=1:nrow(.)) %>%
  ggplot(aes(x=num.trees, y=bac))+
  geom_line()+
  ylim(.2,.8)+
  scale_x_continuous(breaks = seq(0,10001,2000))+
  xlab("#Trees")+
  ylab("Balanced accuracy")+
  ggtitle("Target metric convergence")+
  theme(
    # axis.text.x = element_blank(),
    # axis.ticks.x = element_blank(),
    # strip.text = element_text(size = 15),
    plot.title = element_text(size = 24, face = "bold"),
    axis.title=element_text(size=20,face="bold"),
    text = element_text(size=20)
  )

ggsave(
  "figures/article_figures/Figure 1.pdf",
  dpi = 600,
  height = 12,
  width = 18.75,
  # units = "inches"
  )

8 Importancia de las variables


9 OOB confusion matrix

model$predictions %>% head() # OOB
##            Benign Possibly pathogenic Likely Pathogenic Pathogenic
## [1,] 0.000000e+00           0.1655445        0.37631839 0.45813710
## [2,] 0.000000e+00           0.3749095        0.22951097 0.39557948
## [3,] 0.000000e+00           0.3955345        0.26600260 0.33846290
## [4,] 0.000000e+00           0.6323163        0.21921811 0.14846561
## [5,] 2.897543e-05           0.7137720        0.08177156 0.20442745
## [6,] 0.000000e+00           0.9280949        0.05204490 0.01986019
## Confusion Matrix and Statistics
## 
##                      Reference
## Prediction            Benign Possibly pathogenic Likely Pathogenic
##   Benign                  15                   0                 0
##   Possibly pathogenic      1                 439                72
##   Likely Pathogenic        0                   8                 5
##   Pathogenic               0                  18                13
##                      Reference
## Prediction            Pathogenic
##   Benign                       0
##   Possibly pathogenic         63
##   Likely Pathogenic            4
##   Pathogenic                  35
## 
## Overall Statistics
##                                           
##                Accuracy : 0.734           
##                  95% CI : (0.6989, 0.7671)
##     No Information Rate : 0.6909          
##     P-Value [Acc > NIR] : 0.008098        
##                                           
##                   Kappa : 0.3196          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Benign Class: Possibly pathogenic
## Sensitivity                0.93750                     0.9441
## Specificity                1.00000                     0.3462
## Pos Pred Value             1.00000                     0.7635
## Neg Pred Value             0.99848                     0.7347
## Prevalence                 0.02377                     0.6909
## Detection Rate             0.02229                     0.6523
## Detection Prevalence       0.02229                     0.8544
## Balanced Accuracy          0.96875                     0.6451
##                      Class: Likely Pathogenic Class: Pathogenic
## Sensitivity                          0.055556           0.34314
## Specificity                          0.979417           0.94571
## Pos Pred Value                       0.294118           0.53030
## Neg Pred Value                       0.870427           0.88962
## Prevalence                           0.133730           0.15156
## Detection Rate                       0.007429           0.05201
## Detection Prevalence                 0.025260           0.09807
## Balanced Accuracy                    0.517486           0.64442

Ligera diferencia con probability = F

t(model2$confusion.matrix) %>%
  confusionMatrix()
## Confusion Matrix and Statistics
## 
##                      true
## predicted             Benign Possibly pathogenic Likely Pathogenic
##   Benign                  15                   0                 0
##   Possibly pathogenic      1                 441                74
##   Likely Pathogenic        0                   7                 3
##   Pathogenic               0                  17                13
##                      true
## predicted             Pathogenic
##   Benign                       0
##   Possibly pathogenic         66
##   Likely Pathogenic            3
##   Pathogenic                  33
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7311          
##                  95% CI : (0.6958, 0.7642)
##     No Information Rate : 0.6909          
##     P-Value [Acc > NIR] : 0.01277         
##                                           
##                   Kappa : 0.3018          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Benign Class: Possibly pathogenic
## Sensitivity                0.93750                     0.9484
## Specificity                1.00000                     0.3221
## Pos Pred Value             1.00000                     0.7577
## Neg Pred Value             0.99848                     0.7363
## Prevalence                 0.02377                     0.6909
## Detection Rate             0.02229                     0.6553
## Detection Prevalence       0.02229                     0.8648
## Balanced Accuracy          0.96875                     0.6353
##                      Class: Likely Pathogenic Class: Pathogenic
## Sensitivity                          0.033333           0.32353
## Specificity                          0.982847           0.94746
## Pos Pred Value                       0.230769           0.52381
## Neg Pred Value                       0.868182           0.88689
## Prevalence                           0.133730           0.15156
## Detection Rate                       0.004458           0.04903
## Detection Prevalence                 0.019316           0.09361
## Balanced Accuracy                    0.508090           0.63550

10 Confusion matrix utilizando todo el RF (overfitting)

prob_estimate <- predict(model, data = data_model_imputed)$predictions

class_prediction <- colnames(prob_estimate)[apply(prob_estimate,1,which.max)]
class_prediction <- ordered(
  class_prediction, 
  levels = c("Benign", "Possibly pathogenic", "Likely Pathogenic", "Pathogenic")
  )
confusionMatrix(
  class_prediction, 
  data_model_imputed$Pathogenicity
  )
## Confusion Matrix and Statistics
## 
##                      Reference
## Prediction            Benign Possibly pathogenic Likely Pathogenic
##   Benign                  15                   0                 0
##   Possibly pathogenic      1                 461                39
##   Likely Pathogenic        0                   1                45
##   Pathogenic               0                   3                 6
##                      Reference
## Prediction            Pathogenic
##   Benign                       0
##   Possibly pathogenic         29
##   Likely Pathogenic            1
##   Pathogenic                  72
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8811          
##                  95% CI : (0.8542, 0.9046)
##     No Information Rate : 0.6909          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7221          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Benign Class: Possibly pathogenic
## Sensitivity                0.93750                     0.9914
## Specificity                1.00000                     0.6683
## Pos Pred Value             1.00000                     0.8698
## Neg Pred Value             0.99848                     0.9720
## Prevalence                 0.02377                     0.6909
## Detection Rate             0.02229                     0.6850
## Detection Prevalence       0.02229                     0.7875
## Balanced Accuracy          0.96875                     0.8298
##                      Class: Likely Pathogenic Class: Pathogenic
## Sensitivity                           0.50000            0.7059
## Specificity                           0.99657            0.9842
## Pos Pred Value                        0.95745            0.8889
## Neg Pred Value                        0.92812            0.9493
## Prevalence                            0.13373            0.1516
## Detection Rate                        0.06686            0.1070
## Detection Prevalence                  0.06984            0.1204
## Balanced Accuracy                     0.74828            0.8451

11 OOB probabilities

Benign Possibly pathogenic Likely Pathogenic Pathogenic
0 0.1655 0.3763 0.4581
0 0.3749 0.2295 0.3956
0 0.3955 0.2660 0.3385
0 0.6323 0.2192 0.1485
0 0.7138 0.0818 0.2044
0 0.9281 0.0520 0.0199
model$predictions %>%
# predict(model, data=data_model_imputed)$predictions %>% 
  data.frame() %>% 
  add_column(pathogenicity = data_model_imputed$Pathogenicity) %>%
  data.frame() -> df

df %>% gather("category","prob", Benign:Pathogenic) %>% 
  data.frame %>% 
  mutate(
    category = ordered(
      category,
      levels = c(
        "Benign", 
        "Possibly.pathogenic", 
        "Likely.Pathogenic", 
        "Pathogenic"),
      labels = c(
        "Benign", 
        "Possibly pathogenic", 
        "Likely Pathogenic", 
        "Pathogenic")
  )
  ) %>% 
  ggplot(aes(x = category, y = prob, fill=pathogenicity)) +
  geom_boxplot() +
  scale_fill_manual(
    values=c(
      "Green",
      "Yellow",
      "Orange",
      "Red"
    )
    )+
  facet_grid(.~ pathogenicity) +
  ggtitle("Observed pathogenicity")+
  xlab("Class membership")+
  ylab("Probability")+
  theme(
    # axis.text.x = element_blank(),
    # axis.ticks.x = element_blank(),
    # strip.text = element_text(size = 15),
    plot.title = element_text(size = 24, face = "bold"),
    axis.title=element_text(size=20,face="bold"),
    text = element_text(size=20),
    axis.text.x = element_text(angle = 90, vjust = .3, hjust = .95),
    legend.position = "none"
  )

ggsave(
  "figures/supplementary_figures/Supp. Figure S4.pdf",
  dpi = 600,
  height = 12,
  width = 18.75,
  # units = "inches"
  )
df %>% gather("category","prob", Benign:Pathogenic) %>% 
  data.frame %>% 
  mutate(
    category = ordered(
      category,
      levels = c(
        "Benign", 
        "Possibly.pathogenic", 
        "Likely.Pathogenic", 
        "Pathogenic")
  )) %>% 
  ggplot(aes(x = category, y = prob, color = pathogenicity)) +
  geom_point(position = "jitter") +
  theme(axis.text.x = element_text(angle = 90, vjust = .3, hjust=.95))+
  scale_color_manual(values = c(
    "Benign" = "green",
    "Possibly pathogenic" = "yellow",
    "Likely Pathogenic" = "orange",
    "Pathogenic" = "red")
  )+
  xlab("Class membership")

12 Análisis ROC

12.1 OOB ROC curves by class

df1 <- data.frame(
  ID=1:nrow(data_model_imputed),
  observed=if_else(
    data_model_imputed$Pathogenicity == "Benign",
    1,
    0
  ),
  rf_prob = model$predictions[,1]
    )
df2 <- data.frame(
  ID=1:nrow(data_model_imputed),
  observed=if_else(
    data_model_imputed$Pathogenicity == "Possibly pathogenic",
    1,
    0
  ),
  rf_prob = model$predictions[,2]
    )
df3 <- data.frame(
  ID=1:nrow(data_model_imputed),
  observed=if_else(
    data_model_imputed$Pathogenicity == "Likely Pathogenic",
    1,
    0
  ),
  rf_prob = model$predictions[,3]
    )
df4 <- data.frame(
  ID=1:nrow(data_model_imputed),
  observed=if_else(
    data_model_imputed$Pathogenicity == "Pathogenic",
    1,
    0
  ),
  rf_prob = model$predictions[,4]
    )
roc1 <- roc(
  df1$observed,
  df1$rf_prob
  )
roc2 <- roc(
  df2$observed,
  df2$rf_prob
  )
roc3 <- roc(
  df3$observed,
  df3$rf_prob
  )
roc4 <- roc(
  df4$observed,
  df4$rf_prob
  )
# coords(
#   roc4, 
#   .90, intput = "sens",
#   ret=c("threshold", "specificity", "sensitivity"),
#   , transpose = FALSE
#   )
thresholds <- data.frame(
  thresholds = roc4$thresholds,
  sensitivities = roc4$sensitivities,
  specificities = roc4$specificities
)

thresholds %>% 
  filter(
    sensitivities >= .9
  ) %>% 
  arrange(
    sensitivities,
    -specificities
    ) %>% 
  top_n(1) -> req_cut_off
ggroc(
  list(
    roc1,
    roc2,
    roc3,
    roc4
    ),
    size = 1.2
  )+
  theme_minimal() + 
  labs(
    title = "OOB ROC plot",
    # x = "Specificity",
    y = "Sensibility"
    ) +
  geom_segment(
    aes(x = 1, xend = 0, y = 0, yend = 1), 
    color="grey", 
    linetype="dashed"
    ) +
  # geom_vline(
  #   xintercept = req_cut_off$specificities,
  #   color="grey", 
  #   linetype="dotted",
  #   size = 1.5
  #   ) +
  # geom_hline(
  #   yintercept = req_cut_off$sensitivities,
  #   color="grey", 
  #   linetype="dotted",
  #   size = 1.5
  #   ) +
  geom_point(
    aes(
      x = req_cut_off$specificities[1],
      y = req_cut_off$sensitivities[1],
      color="red",
    ),
    size=20,
    shape = 10
    )+
  scale_color_manual(
    name = "AUC",
    labels = c(
       paste("Benign:",round(roc1$auc,3)),
       paste("Possibly pathogenic:",round(roc2$auc,3)),
       paste("Likely Pathogenic:",round(roc3$auc,3)),
       paste("Pathogenic:",round(roc4$auc,3)),
       paste("Threshold:",round(req_cut_off$thresholds,3))
       ),
    values=c("green", "yellow", "orange", "red", "red"),
    guide = guide_legend(
      override.aes = list(shape = c(rep(NA, 4), 10)))
    )+
  theme(
    # axis.text.x = element_blank(),
    # axis.ticks.x = element_blank(),
    # strip.text = element_text(size = 15),
    plot.title = element_text(size = 34, face = "bold"),
    axis.title=element_text(size=20,face="bold"),
    text = element_text(size=20),
    legend.text=element_text(size=32),
    legend.position = c(0.7, 0.2)
  )

ggsave(
  "figures/article_figures/Figure 3.pdf",
  dpi = 600,
  height = 12,
  width = 18.75,
  # units = "inches"
  )

12.2 In-sample Training ROC curves by class (overfitting)

df1 <- data.frame(
  ID=1:nrow(data_model_imputed),
  observed=if_else(
    data_model_imputed$Pathogenicity == "Benign",
    1,
    0
  ),
  rf_prob = predict(model, data = data_model_imputed)$predictions[,1]
    )
df2 <- data.frame(
  ID=1:nrow(data_model_imputed),
  observed=if_else(
    data_model_imputed$Pathogenicity == "Possibly pathogenic",
    1,
    0
  ),
  rf_prob = predict(model, data = data_model_imputed)$predictions[,2]
    )
df3 <- data.frame(
  ID=1:nrow(data_model_imputed),
  observed=if_else(
    data_model_imputed$Pathogenicity == "Likely Pathogenic",
    1,
    0
  ),
  rf_prob = predict(model, data = data_model_imputed)$predictions[,3]
    )
df4 <- data.frame(
  ID=1:nrow(data_model_imputed),
  observed=if_else(
    data_model_imputed$Pathogenicity == "Pathogenic",
    1,
    0
  ),
  rf_prob = predict(model, data = data_model_imputed)$predictions[,4]
    )
roc1 <- roc(
  df1$observed,
  df1$rf_prob
  )
roc2 <- roc(
  df2$observed,
  df2$rf_prob
  )
roc3 <- roc(
  df3$observed,
  df3$rf_prob
  )
roc4 <- roc(
  df4$observed,
  df4$rf_prob
  )
ggroc(
  list(
    roc1,
    roc2,
    roc3,
    roc4
    )
  )+
  theme_minimal() + 
  labs(
    title = "In-sample ROC plot",
    x = "Specificity",
    y = "Sensibility"
    ) +
  geom_segment(
    aes(x = 1, xend = 0, y = 0, yend = 1), 
    color="grey", 
    linetype="dashed"
    ) +
  theme(legend.position = c(0.7, 0.2))+
  scale_color_discrete(
    name = "AUC", 
    labels = c(
       paste("Benign:",round(roc1$auc,4)),
       paste("Possibly pathogenic:",round(roc2$auc,4)),
       paste("Likely Pathogenic:",round(roc3$auc,4)),
       paste("Pathogenic:",round(roc4$auc,4))
       )
     )

13 Post-hoc tunning Pathogenic

13.1 Alta sensibilidad

Para una sensibilidad requerida del 90% el punto de corte sería 0.07

13.2 TP53MiPaPred_sens rendimiento en datos OOB

Estimación del error fuera de muestra optimista

model$predictions %>% summary()
##      Benign          Possibly pathogenic Likely Pathogenic 
##  Min.   :0.0000000   Min.   :0.01313     Min.   :0.002577  
##  1st Qu.:0.0000000   1st Qu.:0.54143     1st Qu.:0.060108  
##  Median :0.0000000   Median :0.74496     Median :0.117175  
##  Mean   :0.0216053   Mean   :0.67915     Mean   :0.142992  
##  3rd Qu.:0.0000931   3rd Qu.:0.86763     3rd Qu.:0.204811  
##  Max.   :0.9753581   Max.   :0.98100     Max.   :0.592558  
##    Pathogenic       
##  Min.   :0.0008993  
##  1st Qu.:0.0462966  
##  Median :0.0967736  
##  Mean   :0.1562559  
##  3rd Qu.:0.2085338  
##  Max.   :0.8375264
model$predictions %>%
  data.frame() %>% 
  filter(Pathogenic > 0.07) %>% 
  nrow()
## [1] 415
Predicción Frecuencia
Benign 15 (2.2%)
Possibly pathogenic 575 (85.4%)
Likely Pathogenic 17 (2.5%)
Pathogenic 66 (9.8%)
Predicción Frecuencia
Benign 15 (2.2%)
Possibly pathogenic 244 (36.3%)
Likely Pathogenic 0 (0.0%)
Pathogenic 414 (61.5%)
## Confusion Matrix and Statistics
## 
##                      Reference
## Prediction            Benign Possibly pathogenic Likely Pathogenic
##   Benign                  15                   0                 0
##   Possibly pathogenic      1                 212                21
##   Likely Pathogenic        0                   0                 0
##   Pathogenic               0                 253                69
##                      Reference
## Prediction            Pathogenic
##   Benign                       0
##   Possibly pathogenic         10
##   Likely Pathogenic            0
##   Pathogenic                  92
## 
## Overall Statistics
##                                           
##                Accuracy : 0.474           
##                  95% CI : (0.4357, 0.5125)
##     No Information Rate : 0.6909          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1978          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Benign Class: Possibly pathogenic
## Sensitivity                0.93750                     0.4559
## Specificity                1.00000                     0.8462
## Pos Pred Value             1.00000                     0.8689
## Neg Pred Value             0.99848                     0.4103
## Prevalence                 0.02377                     0.6909
## Detection Rate             0.02229                     0.3150
## Detection Prevalence       0.02229                     0.3626
## Balanced Accuracy          0.96875                     0.6510
##                      Class: Likely Pathogenic Class: Pathogenic
## Sensitivity                            0.0000            0.9020
## Specificity                            1.0000            0.4361
## Pos Pred Value                            NaN            0.2222
## Neg Pred Value                         0.8663            0.9614
## Prevalence                             0.1337            0.1516
## Detection Rate                         0.0000            0.1367
## Detection Prevalence                   0.0000            0.6152
## Balanced Accuracy                      0.5000            0.6690

13.3 Alto VPP

13.3.1 in_sample (overfitting)

coordinates %>% 
  t() %>% 
  data.frame() %>% 
  ggplot(aes(x=threshold, y = value, color = variable)) + 
  geom_point(aes(y = ppv, col = "ppv")) +
  geom_point(aes(y = sensitivity, col = "sensitivity"))+
  scale_x_continuous(breaks = seq(0,1,.1))+
  scale_y_continuous(breaks = seq(0,1,.1))

13.3.2 OOB

threshold ppv sensitivity
0.6634399 0.6666667 0.0980392
coordinates %>% 
  t() %>% 
  data.frame() %>% 
  ggplot(aes(x=threshold, y = value, color = variable)) + 
    geom_point(aes(y = ppv, col = "ppv")) +
    geom_point(aes(y = sensitivity, col = "sensitivity"))+
    geom_point(
      aes(
        x = max_ppv$threshold,
        y = max_ppv$ppv, 
        col = "red"
        ),
      shape = 10,
      size=20
      )+
    scale_x_continuous(breaks = seq(0,1,.1))+
    scale_y_continuous(breaks = seq(0,1,.1))+
    ggtitle("PPV and sensitivity vs OOB probability of membership to pathogenic class")+
  scale_color_manual(
    name = "",
    labels = c(
      paste0("Cut-off: ", round(max_ppv$threshold,2)),
      paste0("PPV: ", round(max_ppv$ppv,2)),
      paste0("Sensitivity: ", round(max_ppv$sensitivity,2))
    ),
    values=c("indianred", "indianred", "steelblue"),
    guide = guide_legend(
      override.aes = list(
        shape = c(10,20,20)
        # size = 5
        )
      )
    )+
  theme(
    plot.title = element_text(size = 34, face = "bold"),
    axis.title=element_text(size=20,face="bold"),
    text = element_text(size=20),
    legend.text=element_text(size=28)
    # legend.position = c(0.7, 0.2)
  )

ggsave(
  "figures/article_figures/Figure 4.pdf",
  dpi = 600,
  height = 12,
  width = 18.75,
  # units = "inches"
  )

13.4 TP53MiPaPred_ppv rendimiento en datos OOB

Predicción Frecuencia
Benign 15 (2.2%)
Possibly pathogenic 605 (89.9%)
Likely Pathogenic 38 (5.6%)
Pathogenic 15 (2.2%)
## Confusion Matrix and Statistics
## 
##                      Reference
## Prediction            Benign Possibly pathogenic Likely Pathogenic
##   Benign                  15                   0                 0
##   Possibly pathogenic      1                 449                78
##   Likely Pathogenic        0                  14                 9
##   Pathogenic               0                   2                 3
##                      Reference
## Prediction            Pathogenic
##   Benign                       0
##   Possibly pathogenic         77
##   Likely Pathogenic           15
##   Pathogenic                  10
## 
## Overall Statistics
##                                          
##                Accuracy : 0.7177         
##                  95% CI : (0.682, 0.7514)
##     No Information Rate : 0.6909         
##     P-Value [Acc > NIR] : 0.0713         
##                                          
##                   Kappa : 0.2316         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: Benign Class: Possibly pathogenic
## Sensitivity                0.93750                     0.9656
## Specificity                1.00000                     0.2500
## Pos Pred Value             1.00000                     0.7421
## Neg Pred Value             0.99848                     0.7647
## Prevalence                 0.02377                     0.6909
## Detection Rate             0.02229                     0.6672
## Detection Prevalence       0.02229                     0.8990
## Balanced Accuracy          0.96875                     0.6078
##                      Class: Likely Pathogenic Class: Pathogenic
## Sensitivity                           0.10000           0.09804
## Specificity                           0.95026           0.99124
## Pos Pred Value                        0.23684           0.66667
## Neg Pred Value                        0.87244           0.86018
## Prevalence                            0.13373           0.15156
## Detection Rate                        0.01337           0.01486
## Detection Prevalence                  0.05646           0.02229
## Balanced Accuracy                     0.52513           0.54464

14 Función para predecir en nuevos datos

La idea es crear una sola función que devuelva las probabilidades estimadas de pertenencia de grupo y la predicción de la clase patogénica del RF normal (la clase con mayor probabilidad), TP53MiPaPred_sens y TP53MiPaPred_ppv. Esta función se aplica en nuevos conjuntos de datos (ej. VUS). El resultado de esta función es un data.frame. De los datos nuevos newdata se predice solo sobre los casos completos para los 25 predictores.

TP53MiPaPred <- function(newdata){
  
  # Probabilidades
  prediction <- predict(model, data = na.omit(newdata))$predictions
  
  # Clase predicha con el RF normal
  prediction %>% 
    as.tibble() %>%
    mutate(
      default_rf = colnames(.)[apply(.,1,which.max)],
      default_rf = ordered(default_rf,
      levels=c("Benign",
               "Possibly pathogenic",
               "Likely Pathogenic",
               "Pathogenic")
      )
    ) -> prediction
  
  # Clase predicha con TP53MiPaPred_sens
  prediction$sens_model <- ifelse(
    prediction$default_rf == "Benign", "Benign",
      ifelse(
        prediction[,4] > 0.07, "Pathogenic",
        colnames(prediction[,2:3])[apply(prediction[,2:3],1,which.max)]
  )
)

  prediction$sens_model <- ordered(
      prediction$sens_model,
      levels=c("Benign",
               "Possibly pathogenic",
               "Likely Pathogenic",
               "Pathogenic")
      )
  
  # Clase predicha con TP53MiPaPred_ppv
  prediction$ppv_model <- ifelse(
    prediction$default_rf == "Benign", "Benign",
      ifelse(
        prediction[,4] > max_ppv$threshold, "Pathogenic",
        colnames(prediction[,2:3])[apply(prediction[,2:3],1,which.max)]
  )
)

  prediction$ppv_model <- ordered(
      prediction$ppv_model,
      levels=c("Benign",
               "Possibly pathogenic",
               "Likely Pathogenic",
               "Pathogenic")
      )
  
  return(prediction)
}

15 Variantes VUS

15.1 Datos faltantes

De las 1091 variantes VUS 994 (91%) son casos completos (sin datos faltantes)

Variable Faltantes %
PrimateAI_score 58 5.32
MutPred_score 24 2.20
LRT_score 10 0.92
X29way_logOdds 7 0.64
MutationAssessor_score 3 0.27

16 Aplicación de los modelos a las VUS

vus_pred <- TP53MiPaPred(na.omit(vus_variants))
summary(vus_pred)
##      Benign          Possibly pathogenic Likely Pathogenic  
##  Min.   :0.0000000   Min.   :0.1121      Min.   :0.0008932  
##  1st Qu.:0.0002487   1st Qu.:0.6772      1st Qu.:0.0321051  
##  Median :0.0088325   Median :0.8452      Median :0.0628154  
##  Mean   :0.0809039   Mean   :0.7705      Mean   :0.0942466  
##  3rd Qu.:0.1050374   3rd Qu.:0.9236      3rd Qu.:0.1158526  
##  Max.   :0.6631587   Max.   :0.9932      Max.   :0.4047845  
##    Pathogenic                     default_rf                sens_model 
##  Min.   :0.002988   Benign             : 45   Benign             : 45  
##  1st Qu.:0.022034   Possibly pathogenic:939   Possibly pathogenic:723  
##  Median :0.040404   Likely Pathogenic  :  7   Likely Pathogenic  :  0  
##  Mean   :0.054380   Pathogenic         :  3   Pathogenic         :226  
##  3rd Qu.:0.068714                                                      
##  Max.   :0.457839                                                      
##                ppv_model  
##  Benign             : 45  
##  Possibly pathogenic:942  
##  Likely Pathogenic  :  7  
##  Pathogenic         :  0  
##                           
## 

16.1 Probabilidades de pertenencia de grupo

16.2 RF normal

vus_pred %>%
  pull(default_rf) %>% 
  describeFactors() %>% 
  data.frame() %>% 
  rownames_to_column() %>% 
  remove_rownames() %>% 
  kable(col.names = c("Predicción", "Frecuencia"), format = "html") %>% 
  kable_styling()
Predicción Frecuencia
Benign 45 (4.5%)
Possibly pathogenic 939 (94.5%)
Likely Pathogenic 7 (0.7%)
Pathogenic 3 (0.3%)

16.3 TP53MiPaPred_sens

vus_pred %>%
  pull(sens_model) %>% 
  describeFactors() %>% 
  data.frame() %>% 
  rownames_to_column() %>% 
  remove_rownames() %>% 
  kable(col.names = c("Predicción", "Frecuencia"), format = "html") %>% 
  kable_styling()
Predicción Frecuencia
Benign 45 (4.5%)
Possibly pathogenic 723 (72.7%)
Likely Pathogenic 0 (0.0%)
Pathogenic 226 (22.7%)
vus_pred %>% 
  select(Benign:Pathogenic,sens_model) %>% 
  gather("category","prob", Benign:Pathogenic) %>% 
  ggplot(aes(x = category, y = prob)) +
  geom_boxplot() +
  facet_grid(.~ sens_model) +
  theme(axis.text.x = element_text(angle = 90, vjust = .5))+
  xlab("Membership probability")

16.4 TP53MiPaPred_ppv

vus_pred %>%
  pull(ppv_model) %>% 
  describeFactors() %>% 
  data.frame() %>% 
  rownames_to_column() %>% 
  remove_rownames() %>% 
  kable(col.names = c("Predicción", "Frecuencia"), format = "html") %>% 
  kable_styling()
Predicción Frecuencia
Benign 45 (4.5%)
Possibly pathogenic 942 (94.8%)
Likely Pathogenic 7 (0.7%)
Pathogenic 0 (0.0%)
vus_pred %>% 
  select(Benign:Pathogenic,ppv_model) %>% 
  gather("category","prob", Benign:Pathogenic) %>% 
  ggplot(aes(x = category, y = prob)) +
  geom_boxplot() +
  facet_grid(.~ ppv_model) +
  theme(axis.text.x = element_text(angle = 90, vjust = .5))+
  xlab("Membership probability")

17 Exportación de datos imputados, Modelo Ranger, Función de predicción y VUS con predicciones

17.1 Datos imputados

data %>% 
  filter(Pathogenicity!="VUS") %>% 
  select(cDNA_variant:Mutant_AA_3) %>% 
  cbind(data_model_imputed) %>% 
  openxlsx::write.xlsx("data_model_imputed.xlsx")

17.2 Modelo ranger y función de predicción

Para predecir sobre nuevos datos está disopnible el modelo implementado con el paquete ranger y la función de predicción expuesta arriba.

model$predictions contiene las predicciontes OOB sobre los datos imputados.

save(list = c("TP53MiPaPred","model"), file = "load_to_predict.RData")

# require(ranger)
# load("load_to_predict.RData")

17.3 VUS con predicciones

Sobre las variantes VUS con predictores completos

data %>% 
  filter(Pathogenicity=="VUS") %>% 
  select(cDNA_variant:bStatistic) %>% 
  na.omit() -> vus_variants_na_omit

data.frame(
  vus_variants_na_omit,
  TP53MiPaPred(vus_variants_na_omit)
) %>% 
  openxlsx::write.xlsx("vus_predictions.xlsx")